Modular implementation of the linear- and cubic-scaling orbital minimization methods in electronic structure codes using atomic orbitals

We present a code modularization approach to design efficient and massively parallel cubic- and linear-scaling solvers for electronic structure calculations using atomic orbitals. The modular implementation of the orbital minimization method, in which linear algebra and parallelization issues are handled via external libraries, is demonstrated in the SIESTA code. The distributed block compressed sparse row (DBCSR) and scalable linear algebra package (ScaLAPACK) libraries are used for algebraic operations with sparse and dense matrices, respectively. The MatrixSwitch and libOMM libraries, recently developed within the Electronic Structure Library, facilitate switching between different matrix formats and implement the energy minimization. We show results comparing the performance of several cubic-scaling algorithms, and also demonstrate the parallel performance of the linear-scaling solvers, and their supremacy over the cubic-scaling solvers for insulating systems with sizes of several hundreds of atoms.

IVL, 0000-0002-2880-0275; AG, 0000-0001-5138-9579; EA, 0000-0001-9357-1547; PO, 0000-0002-2353-2793 We present a code modularization approach to design efficient and massively parallel cubic-and linear-scaling solvers for electronic structure calculations using atomic orbitals. The modular implementation of the orbital minimization method, in which linear algebra and parallelization issues are handled via external libraries, is demonstrated in the SIESTA code. The distributed block compressed sparse row (DBCSR) and scalable linear algebra package (ScaLAPACK) libraries are used for algebraic operations with sparse and dense matrices, respectively. The MatrixSwitch and libOMM libraries, recently developed within the Electronic Structure Library, facilitate switching between different matrix formats and implement the energy minimization. We show results comparing the performance of several cubic-scaling algorithms, and also demonstrate the parallel performance of the linear-scaling solvers, and their supremacy over the cubic-scaling solvers for insulating systems with sizes of several hundreds of atoms.
( pddbc) and parallel-distributed compressed sparse row ( pdcsr) MS formats for which algebraic operations are handled with the help of the ScaLAPACK [7] and DBCSR [24] libraries, respectively. Although basic functionality for sparse matrices was already provided in this recent MS version [9,30], a revision of the library was needed towards treating sparse and dense matrices on the same footing, and to enable linear-scaling calculations. The incorporation of the solver library into an electronic structure code also implies additional matrix manipulations such as conversions between the matrix formats supported by the code and the solver library as well as reading and writing of restart files. The corresponding subroutines have been here implemented in MS and are discussed below.  Figure 1. The use of libraries within the revised orbital minimization method (OMM) solver in the electronic structure code SIESTA [11][12][13][14][15]. The red rectangular box corresponds to SIESTA. Blue ellipses indicate the libraries used [7,9,24,25,[27][28][29][30]. The libraries in the dashed frame belong to the Electronic Structure Library (ESL) [9,31]. The arrows demonstrate calls to the libraries.
Listing 1. An example of the calculation of the total charge in the OMM approach using the MatrixSwitch library.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 After a brief overview of the OMM approaches, the new implementation of linear-and cubic-scaling OMM in SIESTA is presented, including the necessary changes in the MS and libOMM libraries forming part of ESL. The results of the first tests are discussed, and recommendations on the efficient use of OMM are given.

Overview of orbital minimization method approaches
In density functional theory (DFT) [35,36], the problem of finding the ground state of a many-electron system is reduced to an energy minimization for the system of 2n non-interacting electrons moving in an effective potential and described by one-particle states {|ψ i 〉} (i = 1, …, n) each of which is occupied by two electrons of opposite spin (assuming no spin polarization, for simplicity). The set of states {|ψ i 〉} is one of the many possible bases in the occupied subspace of the Hilbert space of the system and can be chosen orthonormal or not. In the latter case [37], the overlap matrix S W with the elements (S W ) ij = 〈ψ i |ψ j 〉 is not the identity matrix (S W ≠ I, (I W ) ij = δ ij ) and the density matrix operator that determines the projection onto the occupied subspace is then given bŷ involving the inverse of S W [17,18,22]. Here and below, we limit our consideration to insulating systems. The linear-scaling methods applicable to metals are discussed e.g. in [38][39][40][41].
The corresponding band structure energy becomes [16][17][18]22] whereĤ is the Hamiltonian operator, and H W is the matrix with the elements ðH W Þ ij ¼ hc i jĤjc j i. Note that the traces in equation (2.2) are taken on spaces of different dimensions: the size of the basis set for the first, and of the occupied states in the second. Also, the second equality holds for zero temperature.
In the basis of m functions {|ϕ i 〉} (strictly localized atomic orbitals in SIESTA) where we refer to C as the coefficient matrix. Then H W ¼ C y HC and S W ¼ C y SC, where H ij ¼ hf i jĤjf j i, S ij = 〈ϕ i |ϕ j 〉 and C y is the Hermitian conjugate of C. The energy functional in equation (2.2) is minimized to find the ground state energy. The most common approach is direct diagonalization of the Hamiltonian matrix H (an m × m matrix for the basis set of size m). Energy and charge density are then obtained using the wave functions and energies of the n lowest eigenstates. By contrast, in the iterative approaches [42], the energy is minimized with respect to variations in the states {|ψ i 〉}. Here one needs to calculate the inverse of the overlap matrix S W −1 or impose the orthonormality condition (S W ) ij = δ ij . In any case, the computational time increases as O(n 3 ) with the system size, while the memory required to store the wave functions grows as O(n 2 ). In OMM approaches [16][17][18]22], the expensive orthonormalization step is avoided via the modification of the energy functional in such a way that it automatically induces the orthonormalization of the wave functions during minimizatioñ ð2:4Þ This expression can be derived from consideration of Lagrange multipliers [16,17] or expansion of the inverse overlap matrix to first order in the deviation from the identity [18,22]: The solution obtained from equation (2.4) is the same as from equation (2.2). Within the same approximation, the density matrix of equation (2.1) is computed as [17] r ¼ C(I W þ ðI W À S W Þ)C y ¼ 2Cð2I W À C y SCÞC y ð2:5Þ and the forces on atom I as [17] where we refer to r E ¼ 2CH W C y as the 'energy density'.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 If the basis functions and wave functions are chosen to be strictly localized, the Hamiltonian, overlap and coefficient matrices, H, S and C, are sparse and O(n) scaling with system size is achieved [16][17][18]22]. Note that this is not the case for equations (2.1) and (2.2) as the inverse of S is not sparse (although subcubic scaling can be achieved using selected inversion to compute just the needed elements of the inverse [43]). In the case of periodic systems, localized wave functions are close to the Wannier functions that decay exponentially with the distance from the centre of localization in insulators and in metals at a finite temperature. Imposing localization constraints on the wave functions, however, leads to a deviation from the exact solution of equations (2.2) and (2.4). Also the localized wave functions obtained are not strictly orthonormal and do not comply with the system symmetries [23]. However, the degree of approximation can be controlled with the cut-off radius R C for the wave functions. Here, we limit our consideration to insulators with a substantial band gap, where R C of several Å is normally enough [17,18].
In the Ordejón-Mauri functional [16][17][18]22], the localization of the wave functions gives rise to many shallow local minima and flat regions in which the algorithm can be trapped for a long time during the energy minimization. This problem is solved in the Kim functional [23] by including unoccupied states and introducing a chemical potential η, i.e. the energy separating occupied and unoccupied states. The corresponding functional is obtained by (i) an eigenspectrum shift H → H − ηS, (ii) changing dimensions of C from m × n to m × n 0 , where n 0 > n, and (iii) changing the energy functional in equation (2.4) asẼ !Ẽ þ hn, and energy density ρ E in equation (2.6) as ρ E → ρ E + ηρ. It should be noted, however, that although the multiple-minima problem is solved in the Kim functional, it is sometimes hard to choose a proper value for η. It should always lie within the band gap, but the bands can move up and down during self-consistency or molecular dynamics (MD), η possibly getting into the valence or conduction bands and, as a result, converging to an erroneous solution.
Care should be taken to ensure that the solution reproduces the correct number 2n of electrons. If the localization constraints on the wave functions are removed, the exact solution of equations (2.2) and (2.4) is obtained [28]. Even in this case, however, one energy minimization can demand many CG iterations. This relates to the problem of length-scale or kinetic energy ill-conditioning [42,44]. The efficiency of the CG algorithm depends on the ratio of the maximal and minimal extremal curvatures of the function minimized, which in OMM are determined by the maximal and minimal eigenvalues of the Hamiltonian. The eigenspectrum of the Hamiltonian is broad, given the large kinetic energy of high-energy eigenstates. Although such states contribute negligibly to the ground state solution, the problem becomes illconditioned and the convergence is slow. It is, however, possible effectively to reduce the width of the eigenspectrum by suppressing the kinetic energy contribution of high-energy states through preconditioning [45,46], by which the CG gradient matrix is multiplied by the preconditioning matrix [28] where τ T is the scale for kinetic energy preconditioning and T is the kinetic energy matrix. Another approach for improving the efficiency of CG minimizations is reducing the generalized eigenvalue problem to the standard form via the Cholesky factorization [28]. Both of these approaches involve matrices that are not sparse (the preconditioning matrix or the reduced Hamiltonian) and are considered here only for cubicscaling OMM.

Solver input and output
A scheme of the implemented OMM solver is shown in figure 2. At each self-consistent-field (SCF) step, the solver receives as an input the Hamiltonian and overlap matrices in the basis of strictly localized atomic orbitals, H and S, and the information on the system geometry. SIESTA uses for matrices the standard compressed sparse row format, that is the matrix information is stored in local onedimensional arrays containing data values and column indices for individual non-zero elements of local rows as well as indices of the first non-zero elements and numbers of non-zero elements for each local row. The blocks of rows are distributed on a one-dimensional process grid (figure 4a). Here and in MS we refer to this format as pdrow to distinguish from the pdcsr format supported by DBCSR. H and S are received by the solver in the pdrow format. The density matrix ρ is the output, also in pdrow (see equation (2.5)). This matrix is used to update H for the next SCF step outside the solver. At the royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 end of each MD step, the solver is called again to compute the energy density matrix ρ E , which, along with ρ, is later used to calculate forces (see equation (2.6)) and stresses. The scheme of the ρ E calculation is analogous to that of ρ shown in figure 2.

Solver library
The solver uses the libOMM library [9,27,28,32] to perform the CG minimization of the energy functional given by equation (2.4). As an input, the libOMM library requires H and S, as well as the initial guess for C y , in one of the MS formats [9,29,30,32]. As an output, it provides the converged C y , and ρ or ρ E in the same format. The pddbc format is used for parallel calculations with dense matrices. In this case, all matrix elements are stored and algebraic operations are performed using the ScaLAPACK library [7]. The matrix is divided into two-dimensional blocks distributed on a two-dimensional or onedimensional process grid. For parallel calculations with sparse matrices, the pdcsr format is used. The matrix is also divided into two-dimensional blocks distributed on a one-dimensional or twodimensional process grid (figures 4b,c, respectively). However, in this case, zero blocks are not stored. The algebraic operations are performed by the DBCSR library [24][25][26]. At the moment, libOMM supports only equal rectangular blocks. The equations implemented in the libOMM library are compatible with all OMM flavours discussed in the previous section, including the Ordejón-Mauri and Kim functionals, with and without localization constraints. However, to make the libOMM library functional for sparse matrices, some parts to the code have been reformulated. Now block-size information is passed to the MS library during the allocation of intermediate matrices required for the CG minimization using m_allocate() (see electronic supplementary material). Also, sparsity is imposed on the gradient matrix G (with the elements G m i ¼ @Ẽ=@ðC m i Þ Ã ) following the sparsity pattern of the initial guess for C. Already during G calculation [28], only matrix elements that fit into the sparsity pattern are computed in the contributions to G that are given by products of matrices (using keep_sparsity = true option of mm_multiply()). In the rest of the contributions, non-zero elements that do not fit into the sparsity pattern are omitted and no longer stored, while zero elements within the sparsity pattern are stored as zeros. The sparsity of the density (ρ) and energy density (ρ E ) matrices is assumed to be the same as of the overlap matrix S and only elements of these matrices that fit into the sparsity pattern are  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 computed. Additionally, the expression for the calculation of ρ E has been corrected as compared with the previous libOMM version [28] in accordance with equations (2.4)-(2.6) and [17]. The Cholesky factorization and kinetic energy preconditioning are available only for dense matrices.

C y matrix format conversion
In order to incorporate the libOMM library into SIESTA within the OMM solver, the following steps are required (figure 2): (1) matrix format conversion from/to the SIESTA format to/from the MS formats and (2) initialization and update of C y , according to the current geometry of the system. The matrix format conversion is realized using calls to MS subroutines m_register_pdrow() and m_copy() (see electronic supplementary material). The first of this subroutines has been added to the MS library and the second one has been extended to allow the conversion from/to the pdrow format to/from the pdcsr and pddbc formats. The conversion is performed as follows ( figure 3). First the pointers to arrays of the pdrow matrix and its block size are passed to MS. Then a pdcsr/pddbc matrix distributed on the one-dimensional process grid with the same block size for rows as the initial pdrow matrix is filled in element by element ( figure 4). The missing elements of the pddbc matrix or within non-zero blocks of the pdcsr matrix are filled with zeros. Note that to speed up the conversion and guarantee linear scaling, column and row royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 indices of non-zero blocks of the pdcsr matrix should be passed to the DBCSR library before filling the values via the call to m_reserve_blocks() (see electronic supplementary material). Once the one-dimensionaldistributed pdcsr/pddbc matrix is ready, it can be redistributed on a two-dimensional process grid. In the case when the final matrix is distributed on the one-dimensional process grid and has the same block size for rows at the initial pdrow matrix, the last step is omitted. The conversion from pdcsr and pddbc to pdrow is implemented in a similar way. It is assumed that the row and column indices of non-zero elements of the pdrow matrix are already known. Only the values of the matrix elements are restored.

C y matrix initialization and update
The initialization of the C y matrix in the sparse form is performed in SIESTA in the following way. It is supposed that each atom carries the number of localized wave functions equal to the atomic charge (in units of elementary charge) divided by two, Q at /2. If Q at is odd, (Q at + 1)/2 localized wave functions are assigned to one atom and (Q at − 1)/2 to the next one. This procedure is repeated for all the atoms in the system. Then the C y matrix in the pdrow format with the total number of rows that corresponds to the total number of localized wave functions, N WF = Q/2, where Q is the sum of atomic charges in the system, is prepared. The local rows are assigned according to the block size b WF . By default, it equals the block size for the basis functions, b BF , multiplied by the ratio of the total number N WF of localized wave functions to the basis set size N BF : b WF = b BF N WF /N BF . For each local row, the local environment of the atom hosting the corresponding localized wave function is analysed. The row elements that correspond to atoms beyond some cut-off radius R C from the atom considered are supposed to be zero. The row elements that correspond to atoms within the cut-off radius R C are initialized by random values. This sparsity pattern is maintained during the energy functional minimization. The C y matrix in the pdrow format is converted to the pdcsr or pddbc formats in the same manner as the Hamiltonian and overlap matrices, H and S.
It should be mentioned also that the initial cut-off radius R C,ini for initialization of the C y matrix can be set different from R C used for the energy minimization. Choosing a small initial radius R C,ini (several Å) helps to avoid convergence problems and is useful not only in calculations with sparse matrices but also with dense ones.
At each new MD step, the sparsity pattern of the C y matrix is checked again. The elements that now should be zero because the corresponding atoms got away by more than R C are set to zero and no longer stored. The elements corresponding to the atoms that got closer than R C are now stored and treated as non-zero but are assigned to zero as the initial guess. Linear extrapolation of the C y matrix based on the information from the two previous MD steps is also possible.

C y matrix input and output
The restart file for the C y matrix can be written at each SCF step and read at the beginning of the run. These operations are performed by calling new MS subroutines m_read() and m_write(), respectively (see electronic supplementary material). If the C y matrix in the pdcsr or pddbc format is distributed on a twodimensional process grid, it is first converted into a one-dimensional-distributed matrix (by analogy with the format conversion routines). Then the blocks of rows are consecutively passed to the head core and written to the file. To read the file, the reverse operations are performed. The block sizes and process grid for the C y matrix do not need to be the same as used when writing the restart information. Upon reading, the sparsity pattern of the C y matrix is corrected according to the current system geometry.

SIESTA input parameters
The input parameters for SIESTA corresponding to the revised OMM solver are described in OMM.Cholesky false whether to apply the Cholesky factorization [28] royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 constant is set at 2.48 Å. The height of the simulation cell is 20 Å. The calculations have been performed at the single G point. The local density approximation [47], norm-conserving Troullier-Martins [48] pseudopotentials and standard built-in double-zeta polarized (DZP) basis set [49] are used. The atomic orbitals are set to zero beyond the cut-off determined by the energy shift of 10 meV (cut-off radii 2.5-4.5 Å). The real-space grid is equivalent to the plane-wave cut-off energy of 100 Ry. The linear mixing scheme with a mixing parameter of 0.1 is applied to converge the ground state. The tolerance is 10 −4 for the density matrix and 10 −3 eV for the matrix elements of the Hamiltonian.
To test performance of different approaches in MD simulations, several MD steps starting from the converged ground state have been computed (the ground state is converged previously with the same method as used for MD). The microcanonical ensemble with an initial temperature of 300 K is considered. The Verlet algorithm [50] with a time step of 1 fs is used. The Pulay mixing scheme [51] with a mixing parameter of 0.2 is applied during the MD simulations.
The matrices involved in the calculations consist of equal blocks. For the DZP basis set, each boron and nitrogen atom has 13 basis functions, and hosts three or two wave functions depending on whether the unoccupied states are included into consideration or not, respectively. Therefore, the block size for the wave functions is usually chosen to be b WF = 6 and for the basis functions b BF = 13. The matrices are distributed on a two-dimensional process grid. The cut-off radius for localized wave functions in typical calculations with sparse matrices is R C = 4 Å. The chemical potential for the Kim functional is η = −5.5 eV. CG iterations are performed until the difference of energies at consecutive CG iterations divided by the average energy at these iterations reaches 10 −9 . The tests with the preconditioning for dense matrices have been carried out using the scale for the kinetic energy of τ T = 10 Ry [28].

Results
To compare the performance of diagonalization and OMM with dense and sparse matrices, we have performed test MD simulations for single-layer BN in different sizes. Figure 5 demonstrates that the approaches in which the wave functions are not confined in space have much worse scaling with system size than the methods with localized wave functions within a cut-off radius R C . The scaling of the former approaches is close to cubic for large systems (exceeding 1000 atoms in our calculations). It should be noted, however, that for small systems (within 1000 atoms) the scaling is sub-cubic. The reason is that for such systems the solver contribution to the total time plotted in figure 5 is comparable to the contributions of other parts of the code that have linear scaling with system size. Among the methods using dense matrices, OMM with applied preconditioning or Cholesky factorization, which improve royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 convergence, shows a slightly better scaling compared with diagonalization or plain OMM. Also OMM using the DBCSR library with no localization of wave functions (R C → ∞) clearly has a better scaling than OMM using ScaLAPACK. This is explained by the fact that the former, although having a dense coefficient matrix, still exploits the sparsity of the Hamiltonian and overlap. In the range of system sizes considered, OMM with kinetic energy preconditioning is the fastest among the approaches without wave function localization, followed by OMM with the Cholesky factorization, diagonalization and plain OMM (figure 5). The crossover between preconditioned dense OMM and the linear-scaling methods takes place for the system with about 1200 atoms. For the plain dense OMM and for diagonalization, the crossovers with linear-scaling methods occur earlier, at about 300 and 700 atoms, respectively.
Our timings for single-layer BN have confirmed that the Ordejón-Mauri and Kim approaches in which the wave functions are localized within a cut-off radius R C show linear scaling with system size (figure 6a). The computational times corresponding to different parts of the solver (matrix conversion, libOMM library, initialization and update of the coefficient matrix, reading and writing of restart for localized wave functions) and other parts of the SIESTA code such as the subroutine for the Hamiltonian update called after the density matrix change at each SCF step (DHSCF), all do change linearly upon increasing the system size. As a result, relative contributions of different parts of the code do not depend on the system size ( figure 6b). This is different from the cubic-scaling methods, in which the solver very early takes most of the computing time upon increasing the system size, since the rest of the code has linear scaling. It should also be noted that, for the system considered, the solver takes only 40-50% of the computational time, comparable, for example, to the subroutine for the Hamiltonian update (DHSCF in SIESTA). Most of this time   7a). For such block sizes, the computational time grows upon increasing the block size (note that the growth continues beyond the block sizes shown in figure 7a) and has the minimum at b BF = 13. The wave function block-size b WF dependence reaches the minimum at b WF = 6 −10. At small b WF , a fast growth of the computation time is observed. It can be attributed to an increase in the number of non-empty blocks considered upon decreasing the block size. At large b WF , the computational time also grows but at a slower rate. This dependence can be explained by increasing the number of matrix elements that are stored and explicitly considered in matrix operations. Therefore, we find optimal block sizes both for the wave functions and basis functions of the order of 10. Furthermore, chemical considerations can be exploited when dividing matrices into blocks. Still, the optimal choice of block sizes for complex systems is not straightforward and requires further investigation [26].
The CPU scaling of the libOMM solver library in calculations with sparse matrices using DBCSR is shown in figure 8a. A similar CPU scaling is observed for systems of different size (figure 8a), with different block and basis set sizes. The computational time decreases by a factor of about 2.5 upon  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 doubling the computational cost. Such a speed-up is observed for CG energy minimization and subsequent calculation of ρ. It should be noted, however, that calls to libOMM for calculation of ρ E involving only two matrix multiplication operations show much better CPU scaling. This can be appreciated from a twice steeper slope of computational cost versus computational time as compared with the calls for energy minimization and calculation of the density matrix (figure 8b). It can, therefore, be expected that the solver parallelization might be further improved via proper code refactoring. The use of OpenMP, GPUs and the library for small matrix multiplication (LIBXSMM) [52] are known to lead to a superior DBCSR performance [25,26], which also requires investigation.

Recommendations for orbital minimization method solver use
The new modular implementation of the OMM solver makes it easier to disentangle technical problems in e.g. parallelization from drawbacks of the OMM itself. Here, we present the first implementation of the solver using external libraries that represents the starting point for further performance improvement and method polishing. Ways to improve the solver performance were mentioned in the previous subsection. We briefly discuss now the drawbacks of the OMM and how they can be addressed. One of the most important methodological problems of the OMM approach is in the minimization, which can require a large number of CG iterations. As shown in figure 9, the first SCF iteration from scratch is rather costly both for the linear and cubic-scaling OMM. For the linear-scaling methods, the first SCF iteration can include thousands of CG steps, followed by tens of SCF iterations with hundreds of CG steps each. After that each SCF step needs just a few CG iterations, becoming very fast. It should royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 be noted that except for the very first SCF iterations, the linear-scaling and plain cubic-scaling OMM require roughly the same numbers of CG steps. However, kinetic energy preconditioning or Cholesky factorization significantly reduce the number of CG iterations required, with a considerable computational-time reduction (see also figure 5). Therefore, it is always recommended to use either of both ways to deal with kinetic energy ill-conditioning in dense OMM. The extension of these approaches to sparse matrices is not straightforward and requires further investigation.
Also starting from scratch, one can get into regions in parameter space where the energy functional does not have a minimum in the CG line minimization. To avoid this situation, we recommend using a small cut-off radius R C,ini for the initial guess of wave functions both for linear-and cubic-scaling OMM. It is also recommended to preconverge the ground state using a small linear-mixing parameter. Starting from as low as 0.01 can be required for very large systems. It can then be gradually increased to normal values of 0.1-0.2. After getting close to the ground state, the use of other mixing schemes is possible. If the geometry of the system is far from the optimal one, a reduced step for geometry optimization may also be needed when starting.
In figure 10, we address the accuracy of force and energy calculations with the Ordejón-Mauri and Kim functionals for BN. The deviation from the results for the wave functions without localization royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230063 (R C → ∞) is plotted for different cut-off radii R C . It is seen that for both of the functionals, the accuracy improves upon increasing the cut-off radius in a similar manner. The deviations of the energy and forces within 0.01 eV atom −1 and 0.02 eV Å −1 are achieved already for the cut-off radius of R C = 4 Å. These results confirm that for insulating systems with a substantial band gap, it is sufficient to consider cutoff radii of several Å [17,18]. The Ordejón-Mauri and Kim functionals were designed for insulating systems with a substantial gap. For metals, a smearing function needs to be introduced. However, this is not easy since the information on individual Kohn-Sham eigenstates is missing in OMM. An idea for combining OMM with another method resolving eigenstates close to the Fermi level was proposed in [28] but still requires exploration. Note that modelling of metallic systems requires a much more significant computational effort than modelling of insulators [38,39,41].
As for magnetic systems, the OMM calculations can be performed taking into account spin polarization. At each SCF step, the coefficient matrices for spin up and spin down are found sequentially. All the observations for non-spin-polarized systems discussed above still hold in this case.

Conclusion
We have demonstrated how modularization simplifies the implementation of new solvers in electronic structure codes by revising the OMM solver in the SIESTA code [11][12][13][14][15]. Matrix algebra operations and parallelization are efficiently handled via external libraries. In particular, the implementation benefits from two ESL [9,31] libraries: libOMM [9,27,28,32] and MS [9,29,30,32]. The libOMM library is used to perform the minimization of the energy functional, while the MS library serves as an interface to low-level algebraic routines facilitating switching between different matrix formats. These libraries have been extended to make possible not only cubic-scaling but also linear-scaling OMM calculations for insulating systems with a substantial band gap. Now the energy functional minimization in libOMM can be carried out for sparse matrices with the DBCSR library [24][25][26], in addition to dense matrices using ScaLAPACK [7]. To facilitate incorporating libOMM into electronic structure codes based on atomic orbitals, MS has been also supplemented with subroutines for matrix format conversion and matrix reading and writing. The solver library libOMM can be easily further developed in the MS language for the implementation of new solvers.
The extended MS and libOMM libraries available through ESL [9,31] can be used for implementation of linear-and cubic-scaling OMM approaches in other codes. The libraries can be used with different types of local basis sets. The only condition for achieving the linear-scaling behaviour is that either the basis functions go to zero beyond some cut-off radius or the elements of the input matrices are filtered with respect to some tolerance to ensure that the matrices are sparse. Note that implementation of custom conversion routines is needed if the matrix format is different from the MS or SIESTA formats.
To test the performance of the new OMM and traditional diagonalization solvers available in SIESTA, large-scale calculations have been performed for a BN layer. When sparse matrices and localized wave functions are used, linear scaling with system size is achieved in practice, as expected. Matrix conversion, reading and writing of restart files, as well as initialization and update of the localized wave functions take a small fraction of the computational time. For the linear-scaling methods that fraction does not depend on system size. The cubic-scaling OMM with kinetic energy preconditioning performs best for small systems, even better than diagonalization. For plain OMM, diagonalization and cubic-scaling OMM with kinetic energy preconditioning, the crossovers with linear-scaling methods are observed at about 300, 700 and 1200 atoms, respectively. The best performance for the linear-scaling OMM with sparse matrices is achieved when the wave functions and basis functions are divided into blocks of sizes around 10, taking into account the chemical structure. The OMM solver is MPI-parallelized. When using the DBCSR library [24][25][26] for algebraic operations with sparse matrices, the computational time decreases by a factor of 2.5 upon doubling the computational cost. It is expected that CPU scaling can be further improved via refactoring some operations in the libOMM library, using OpenMP and GPUs, etc.
To perform OMM calculations from scratch, it is recommended to start using a small linear-mixing parameter (down to 0.01), a small step for geometry optimization, and cut-off radii for the wave functions of a few Å. For the cubic-scaling OMM, the convergence becomes much faster with kinetic energy preconditioning or Cholesky factorization. The extension of these approaches to sparse matrices demands further investigation.